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Abstract 

Background: Transcriptional regulation by transcription factor (TF) controls the time and abundance of mRNA 
transcription. Due to the limitation of current proteomics technologies, large scale measurements of protein level 
activities of TFs is usually infeasible, making computational reconstruction of transcriptional regulatory network a 
difficult task. 

Results: We proposed here a novel Bayesian non-negative factor model for TF mediated regulatory networks. 
Particularly, the non-negative TF activities and sample clustering effect are modeled as the factors from a Dirichlet 
process mixture of rectified Gaussian distributions, and the sparse regulatory coefficients are modeled as the 
loadings from a sparse distribution that constrains its sparsity using knowledge from database; meantime, a Gibbs 
sampling solution was developed to infer the underlying network structure and the unknown TF activities 
simultaneously. The developed approach has been applied to simulated system and breast cancer gene expression 
data. Result shows that, the proposed method was able to systematically uncover TF mediated transcriptional 
regulatory network structure, the regulatory coefficients, the TF protein level activities and the sample clustering 
effect. The regulation target prediction result is highly coordinated with the prior knowledge, and sample 
clustering result shows superior performance over previous molecular based clustering method. 

Conclusions: The results demonstrated the validity and effectiveness of the proposed approach in reconstructing 
transcriptional networks mediated by TFs through simulated systems and real data. 
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Background 

Transcription factor is one major gene regulator that 
governs the response of cells to changing endogenous or 
exogenous conditions [1]. Understanding how transcrip- 
tional regulatory networks (TRNs) induce cellular states 
and eventually define the phenotypes represents a major 
challenge facing systems biologists. So far, numerous 
models have been proposed to infer the transcriptional 
regulations by TFs including, ordinary differential equa- 
tions, (probabilistic) Boolean networks, Bayesian net- 
works, and information theory and association models, 



* Correspondence: yhuang@utsa.edu 

department of Electrical and Computer Engineering, University of Texas at 
San Antonio, San Antonio, Texas, USA 

Full list of author information is available at the end of the article 



etc [2]. Ideally, the TF protein level activities are needed 
for exact modeling; however, due to low protein cover- 
age and poor quantification accuracy of high throughput 
proteomics technologies such as protein array and liquid 
chromatography-mass spectrometry (LC-MS), the mea- 
surements of TF protein activities are currently hardly 
available. As a compromise, most of the aforementioned 
models conveniently yet inappropriately assume the TF's 
mRNA expression as its protein activity. Given the fact 
that gene mRNA expression and its protein abundance 
are poorly correlated [3,4], these models cannot accu- 
rately model the transcriptional ds-regulation or reveal 
at the best TF £rans-regulation. 

In contrast, works based on factor models [5-10] point 
to a natural and promising direction for modeling the 
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TF mediated regulations, where the microarray gene 
expression is modeled as a linear combination of 
unknown TF activities, and the loading matrix in this 
model indicates the strength and the type (up- or down- 
regulation) of regulation. However, due to distinct fea- 
tures of TF regulation, conventional FA model is not 
readily applicable. First, due to various reasons (normal 
and disease, cancer grade, subtypes, etc), the samples 
are usually not independent with each other but show 
some clustering effect; while in the existing FA models, 
factors are typically assumed independent, which, 
although true in many applications, is not a realistic 
assumption for TF medicated regulation. Secondly, since 
a TF only regulates a small subset of genes, the loading 
matrix should be sparse. While with constructions of TF 
regulation databases, such as TRANSFAC [11], the 
knowledge of TF regulated genes becomes increasingly 
available, and should be included in the model so as to 
boost signal-to-noise and improve performance [12]. 
The inclusion of prior information and sparsity con- 
straint naturally call for a Bayesian solution. As an 
added advantage, having this prior knowledge actually 
resolves the factor order ambiguity of the conventional 
factor analysis. Thirdly, as suggested in [13-15], the 
non-negative assumption on TF activities be imposed. 

In a response to these requirements of modeling TF 
mediated regulatory networks, we propose here a novel 
Bayesian non-negative factor model (BNFM). Different 
from conventional factor analysis models, BNFM con- 
sists of a sparse loading matrix and a set of correlated 
non-negative factors. The sparsity of the loading matrix 
is constrained by a sparse prior [16] that directly reflects 
our existing knowledge of TF regulation. That is if a 
gene is known to be regulated by a TF, then the prior 
probability that this regulation exists is high, and other- 
wise, very low due to the generic sparse nature of TF 
regulation (A TF only regulates a small number of genes 
in the whole genome). Because of clustering effect on 
the data samples, the factors in this BNFM model are 
considered to be correlated and modeled by a Dirichlet 
process mixture (DPM) prior [17]. DPM imposes a nat- 
ural non-parametric clustering effect [18] among sam- 
ples of the same TF and can automatically determine 
the optimal number of clusters. Moreover, since the 
activities of TFs are non-negative, they are assumed to 
follow a (non-negative) rectified Gaussian distribution 
[19]. Due to the complex nonlinear structure of the 
BNFM, the estimation of the model becomes analytically 
infeasible and highly complicated numerically. A Gibbs 
sampling solution is developed to infer all the relevant 
unknown variables. 



Method 

Bayesian non-negative factor model 

Let y n ge 1Z Gxl for n = 1, . . . , N represent the n-th micro- 
array mRNA expression profile of G genes under a speci- 
fic context. In practice, microarray data y n register the 
log2 scaled (fold change of) gene expression levels under 
the context of interest relative to a background often 
obtained as the average expression levels among a variety 
of contexts, such as different cell lines and tumors 
[20,21]. We assume that the expression level y n is due to 
the linear combination of scaled TF absolute protein acti- 
vites and modeled by the following factor model 

y n =Ax n +c + e n (1) 

where, 

x n - the n-th sample vector of the scaled activities of L 
TFs of interest. Particulary, the non-negativity of x n is 
modeled by applying the component-wise rectification 
(or cut) function cut to a vector pseudo factors s m such 
that the /-th element of x n is expressed as 

x hn = cut(s u ) = max(5 U , 0) (2) 

Since clustering effects may exist among samples, the 
samples should be correlated. Therefore, pseudo factors 
s n are modeled by a Dirichlet Process Mixture (DPM) of 
the Gaussian distributions as 

G ~ DP{a,NIG{jj 0 ,K 0 ,a 0 ,p 0 )); 

where, A/"(/i; n' G ln) re P resen ts the Gaussian distribu- 
tion with mean [i^ n and variance crf n , DP denotes the 
Dirichlet process, and NIG is short for the conjugate 
Normal-Inverse-Gamma (NIG) distribution. This DPM 
model implies a clustering effect on s n such that 

h,n\ywHy n ' G ly n ~ N{lH lYn ,°l rn ) ® 
and 

{Vl lYn >°i lYn )~ MIG{n 0 ,K 0 ,a 0 ,p 0 );Y n - GEM(a); (4) 

where, y n s Z represents the cluster label of the n-th 
sample and is governed by a discrete GEM distribution 
[17], which defines the stick breaking process with para- 
meter a; this implies that the elements of s n are corre- 
lated. Based on (2) and (3), we have 

Hn\rn^ir n ' G tr n ~ N R {fu lirn ,cjl irn ) (5) 
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where, j\f R denotes the rectified Gaussian distribu- 
tion [19]. Since {fA ly ,af r ) and y n are still defined in 
(4) by the DP, x n is hence modeled by the DPM of the 
rectified Gaussian distributions and the elements of x n 
are accordingly correlated. In contrast to the conven- 
tional mixture model, the DPM model enables the num- 
ber of clusters to be learnt adaptively from the data 
instead of being predefined. 

A- the G x L loading matrix, whose element a g j 
represents the regulatory coefficient of the g-th gene by 
the /-th TF. Since a TF is known to regulate only small 
set of genes, A should be sparse. In our model, the ele- 
ments of A are assumed to be independent and with the 
a priori distribution [16] 

Pi*g,i) = (! - n gl lM a g,i) + ng.iM{*g,i\0'°li) ( 6 ) 

where, n gt i is the a priori probability of a g j to be non- 
zero. For instance, if a TF regulates a total of 500 genes 
among the 20000 genes in the human genome, then n gt i 
is equal to 

n gil = 500/20000 = 0.025 

In most cases, n g j are likely to be smaller than 10%. In 
practice, databases such as TRANSFAC [11] and DBD 
[22] provide information of experimentally validated or 
predicted target genes of TFs, and this knowledge can 
be incorporated in the model by setting, for instance, n gt 
i = 0.9, if TF / is known to regulate gene g; or otherwise 
7T g j = 0.025. The variable <j^ x defines how much the 
target genes are loaded on the corresponding TF and 
with prior distribution cr^ ~ Inv-Gamma(a a , fi a ) . 

c- a vector of constant, which can be considered as 
the constant term retained when linearizing the general 
relationship y n = f(x n ) as y n = Ax n + c. It may also be 
interpreted as static response of gene transcriptional 
expressions. 

e n - the G x 1 white Gaussian noise vector character- 
ized by the covariance matrix X = diag(cjg G ) 
and with prior distribution <j^ g ~ Inv-Gamma(a n , f5 n ) . 

The overall graphical model is shown in Fig.l. 

Equivalent model for centralized observations 

To infer a factor model (1) more efficiently, the observa- 
tion mean is usually removed at the first stage to elimi- 
nate the effect of the constant term c, resulting the 
equivalent model for centralized observations y m where, 

N 

Yn = Yn ~ f*y an d ju y = ^^y n /N. Traditionally, since 

n=l 

the models typically assume zero mean for the factors, 
the equivalent model for centralized observations 



remains the same except that the constant term is elimi- 
nated, i.e., if y n = Ax n + c + e w then, for the centralized 
data y w 

y n =Ax n +e n (7) 

and fly can be viewed as an ML estimator of the con- 
stant term c [23]. For BNFM, however, since the factor 
mean is no longer zero, the equivalent model for BNFM 
no longer remains the same as above mentioned tradi- 
tional model, but instead, 

y n =Ax n +e n (8) 

where, 

X n = X n - JU X (9) 

N 

^=£x„/N (10) 

n=l 

Given sufficient number of samples, the sample mean 
Px = [fxs f*x 2 > ->f*x L ] J can be approximated with the 
mean of prior distribution (4) (5), which can be calcu- 
lated numerically. We can also see that the correspond- 
ing centralized factors are a shifted version of the 
original factors, and different samples shift the same 
amount, so sample clustering effect is still retained. On 
the other hand, the removed term from data centraliza- 
tion is no longer an estimator of the constant term c, 
but, 

V Y =Av x + c (11) 

The goal is to obtain the posterior distributions and 
hence the estimates of A, x n Vn, y^n, given y w Vn and n g j 
Vg, /, which is the TF binding prior information extracted 
from existing database. For convenience, we let 0 denote 
all the known and unknown variables. 

Gibbs sampling solution 

The proposed BNFM model is high dimensional and 
analytically intractable, so a Gibbs sampling solution is 
proposed. Gibbs sampling devises a Markov Chain 
Monte Carlo scheme to generate random samples of the 
unknowns from the desired but intractable posterior dis- 
tributions and then approximate the (marginal) poster- 
ior distributions with these samples. The key of Gibbs 
sampling is to derive the conditional posterior distribu- 
tions and then draw samples from them iteratively until 
convergence is reached. The proposed Gibbs sampler 
can be summarized as follows: 
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Figure 1 Graphical model. The Bayesian graphical model is shown here. y n is the observed mRNA gene microarray data, the prior probability 
of regulation n gJ is extracted from TRANSFAC database, a ni /3 n , A 0 , a, an, p a are the hyperparameters, and the rest variables are unknown and 
need to be estimated. 



Gibbs sampling for BNFA 

Iterate the following steps and for the £-th iteration: 

1. Sample cr^Vg from p{a 2 e ; |B_ j2 ); 

2. Sample cj?* t] Vg,l from p{o} {t \® e \ [t) ) ; 

3. Sample a^J\/g from p(a g j\®_ a J; 

4. for n = 1 to N 

Sample y W from p{Y n \®_ s ; B y ) ; 
Sample from P(xn|B_ s " V) ] 
Sample s ( n f) from p{s n \®_ s ); 

Note that fjL lk ,of^s/l,k are marginalized and there- 
fore does not need to be sampled. The algorithm iter- 
ates until the convergence of samples, which can be 
assessed by the scheme described in [24], [chap. 11.6]. 
The samples after convergence will be collected to 
approximate the marginal posterior distributions and 



the estimates of the unknowns. Since fi x can be approxi- 
mated and calculated numerically, the factor x n can be 
recovered from the centralized factor x n with (9). The 
required conditional distributions of the above proposed 
Gibbs sampling solution are detailed in the next. 

Conditional distributions of the proposed Gibbs sampling 
solution 

For simplicity, we let x w and y n denote the centralized 
factors and data in this section. 
Let E = Y - AX and = [e & \, e giN ] T , then, 
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where, y gJ = x t a g i + e^, x/ = [#/,!,..., x Ln \ T and = [e gtl , 
-> e gj*] T - The posterior distribution of a gh 

P{a g ,i\®- asl ,y liN ) 

= Z Q p{y gl \x }l a gJl al s )p{a gj} ) (12) 
= Z 0 [(l - n gil )M{y g Jx ; a^,a e y N )<5(fl s ) + ^^(^Jx^^^^InJ^^IO^^)] 
= (l-w 4 ,/)5(fl 4#l ) + w x ,//(fl 4(I ) 

where Z 0 is a normalizing constant, 
71 gl \ = 7i g i I [(1 - 7i g l )BF 01 + 7i g i\ is the posterior prob- 
ability of a g j * 0 and BF 0 i is the Bayes factor of model 
a g i - 0 versus model a g i * 0 



where denotes a new cluster other than the existing 
K,S_ n k = {i\i ^ n,y i =k} represents the set of the 
pseudo factors besides s t that also belong to cluster /<, 
N_i£ is size of S_ n k , and 

K L 

Z 0 =^N_ nik g k +ag- k ; g k =Y\gi,k> 



Si,k = $ 



with, 



-t*x-Vl,n,k 



- T - 2 



^01 = 



p(y j|x„ * o, trl) jo, c M(I ) 



2 ^2 ^2 '-p r p ^ 2 -\— 1 ^2 

cr = cri, n ,fe -oi,n,k&i (a z a z cr/ /n ,fe+X) ajcr^fe; 



with C M// = x l x l <j al + <j eg l N ; f{a gl ) is the poster- 
ior distribution for ^/ * 0 and defined by 

f{a gil ) = N{a gil \ii a Ql >°a g ,i) 



where, 



~ l u " , T - N k _ n Sl,k-n 

ai, n ,k = « 0 + ~ N k,-n> K ln,k = K Q + N ki _ n , jU lnk - 



^0 + Nfe,-n 



where, ^ = ( + ^ J" 1 ^ and ^^o^( J (vW^(W)i 



CT L = - °a,l x J ( x ^a,/ x ^ + ) 1 ' ^ d 

n g i is the prior knowledge of the probability of a gt to be 
non-zero. When n gt i - 0.5, i.e, a noninformative prior 

on sparsity is assumed, n gl i depends only on BF 01 , and 

n gl i < 0.5 when BF 01 > 1. Since model selection based 

BF 0 i favors a gt i - 0, it suggests that this Bayesian solu- 
tion favors sparse model even when n g j = 

0.5. P(7nl @ -x„, 7w ) 

It should be noted that y n does not depend on x n in 
the distribution. It is intended that samples of y n from 
this distribution are not affected by the immediate sam- 
ple of x w thus achieving faster convergence of the sam- 
ple Markov chains. To derive this distribution, first let 
fi >n - 2LiXi fH + e n with a/ being the /-th column of A and 
hence y ln ~ N{&iX l n ,Z) . Then, 

P(rn\®-x n ,y n >Yl:N) 

= P{Yn\Y-n'Yi:L / n) 

PiYn^nW-n'Yi.L^n 



Sl,k,-n = ^ S hi I N kr _ n , 



and, 



7, -7, ■ ~ 2 . - Pl,n,k (yufc + 1) . 

(a/, n ,fe-lj Kl, n ,k 

Noted that, for a new cluster, k = k, S_i k = (j) and N_ 
i t k - 0, and gr can be derived from g k for k = k 
similarly. Pi x l,n\®-x hn ) 

This distribution can be expressed as 

where, JV^^iuL-^^+oo) represents^ trun- 
cated Gaussian with parameters (ju lnk ,C7ink) an ^ 
between the interval (-^,+°°), and, 



=1; 

= — [p(yi:L,nl X nM x n^nl x -n^-n)^ x n 

1 

= y~ (2j N -n,kZk$irn - fe) + a«fe5(X/ - fe)) 



1 + 



(13) 



A/" 



^ X I^Xi „ u 



fe=l 



P(^UI Q -5 U ) 

According to the graphical model, given the con- 
ditional distribution of s iyn does not depend on y 1:N ; As 
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the predictive density |s_/ )W , yd is shown to be a 
Student-t distribution, which can be conveniently 
approximated as a normal distribution when N_ t>k is 
large: 

P(Sl,n) = N{ld hnk ,CJl,n,k) 

and conditional distribution can be expressed as 

where, n S[n = 1 - sgn (x hn + ^PfaeJ®-^ ) 
Let the residuals E = Y - AX, and we have, 
e g ~ A/"(0,cr^I N ), where = [e^, e^ 2 , e^] T Given 
the conjugate Inverse-Gamma prior, we have 

P(°e, g \®>Yi:N) = P(?lg\*g) = Inv-Gamma(a r p g ) (15) 

where Inv-Gamma represents the Inverse-Gamma dis- 
tribution and 

N 

a g =a Q + N/2; j5 g = P 0 +^e 2 g j2; 

n=l 

With the prior distribution 

of ~Inv-Gamma(a fl ,jSJ, the conditional probability 
of of is, 

of ~Inv-Gamma(a fl#/ ,j8 flf/ ) 
where, a aX = a a + \ N a>l and 

z A + iX c > and N a,i is the size of 

Sa,l={gW g ,l*0}- 

Results 

Test on simulated system 

The proposed BNFM model was first tested on a simu- 
lated system, in which the microarray data consists of 
the expression profiles of 150 genes with 40 samples. 
The samples form 5 clusters and the 150 genes were 
assumed to be regulated by 10 TFs. The sparsity of 
loading matrix was set at 10%, which means that on 
average each gene is regulated by 1 TFs, and each TF 
regulates 15 genes. To simulate a practical imperfect 
database, the precision and recall of the prior knowledge 
were both set equal to 0.9 each, i,e., 90% of the database 
recorded regulations indeed happened in this specific 
data set (10% of the database recorded regulations may 
be context-specific and didn't happen in the data); and 
90% of the true regulations was recorded in the database 
(10% of true regulations are not in the database). This 



setting indicates that the recorded prior regulations may 
not exist in the experiment, and the unknown regula- 
tions could exist. Since this is a relatively large data set 
involving sampling of many variables, instead of examin- 
ing convergence based on [24], [chap. 11.6], we adopted 
a more practical strategy by running a single MCMC 
chain for 10000 iterations with a burn-in period of 2000 
iterations [25]. 

Since the algorithm estimates the loading matrix, the 
factors, the clustering result, and TF regulatory targets, 
to evaluate the performance, four respective metrics 
were computed. Particularly, in order to systematically 
evaluate the clustering result, a Van Rijsbergen's F 
metric [26] that combines the BCubed precision and 
recall [27] was implemented as suggested in [28]. More 
specifically, let L(e) and C(e) be the category and the 
cluster of an item e. Then, the correctness of the rela- 
tion between e and e is defined by 

Jl \tiL{e) = L{e')^C{e) = C{e') 
Correctness^, e ) = < 

[ 0 otherwise 

That is, two items are correctly related when they 
share the same cluster. Moreover, the BCubed precision 
and recall are formally defined as 

Precision BCubed = Avg^fAvg^c^^c^'^Correctnessf^,/)]] 
Recall BCubed = Avg^tAvg^^^^^^'^Correctnessfg,/)]] 

These two metrics can be further combined using Van 
Rijsbergens F metrics: 



0.5/P + (l-0.5)/K R + P 

The F metrics satisfy all the 4 formal constraints 
defined in [28] including cluster homogeneity, cluster 
completeness, rag bag, and cluster size vs. quantity. We 
adopt the F metrics to evaluate the clustering result in 
the following tests. Similarly, a Van Rijsbergen's F 
metric that combines the target prediction precision and 
recall is used to measure the target prediction result. 
Since our model can avoid sign ambiguity problem, the 
loading and factor estimations were evaluated using its 
Pearson's correlation with their true values. 

Experiments were carried out to test the impact of 
noise (Fig. 2), database precision (Fig. 3) and database 
recall (Fig. 4) on the performance of the algorithm. It 
can be seen from the simulation result that, at the low 
noise level or with high quality prior database, the 
developed algorithm can produce satisfactory result. 
Expectedly, the performance of the algorithm decreases 
as the noise variance increases or database quality 
decreases. However, the clustering performance is more 
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(a) Clustering Result 
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(b) Factor Estimation Result 
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(c) Regulated Target Prediction Result 
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(d) Loadings Estimation Result 
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Figure 2 Impact of noise. The performance vs noise is shown here. Two noise conditions (cr ^ _ q 2 f 0.4) are tested. It can be seen from 
the figure that, the algorithm performance deceases as noise increases. While the clustering result relies on noise heavily (Fig.2-a), the target 
prediction is relatively more robust against noise (Fig.2-c). 



sensitive to noise (Fig.2), while target prediction result 
relies more heavily on the quality of database prior 
knowledge (Fig. 3-4), because database directly support 
regulation posterior probability through its prior prob- 
ability. In summary, the simulation results are indicative 
of satisfactory performance of the developed Gibbs sam- 
pling algorithm. 

Test on breast cancer data 

After validating the performance of the proposed algo- 
rithm by simulation, the algorithm was then applied to the 
breast cancer microarray data published in [29-32]. Parti- 
cularly, we applied the algorithm to 53 samples of grade 3 
ER + breast cancer. All samples came with gene microarray 
expression, ER status and survival time information. For 
the settings of the algorithm, we first manually selected a 
total of 15 TFs that are reported to be relevant to breast 
cancer (Table 1) and then retrieved a total of 199 



regulated target genes (Table 2) by these TFs from 
TRANSFAC database [11] (Release 2009.4). We assume 
that TRANSFAC record has a 90% precision and 90% 
recall, suggesting that the known regulations may be con- 
text-specific and unknown regulations could exist. From 
the precision and the recall, the prior probability of the 
loading matrix can be determined. Based on these settings, 
the proposed approach was applied to the breast cancer 
data set to infer the underlying regulatory networks and 
TF activities. The posterior distribution of the loading 
matrix (Fig.5) gives insight into the sparsity of inferred TF 
mediated regulation. It can be seen that the posterior 
probability of regulations fall into 2 distinct groups, i.e., 
one group has very small posterior probabilities, which 
correspond to regulations that do not exist; while the 
other group have larger posterior probabilities, which cor- 
respond the regulations that are likely to exist. Fig. 6 
depicts possible regulations and their posterior 
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(b) Factor Estimation Result 
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Figure 3 Impact of database precision. The performance vs database precision is shown here. It can be seen from the figure that, the 
algorithm performance deceases as noise increases. While the target prediction relies on the quality of database heavily (Fig.3-c), the clustering 
result is relatively robust against database quality (Fig.3-a) . 



probabilities (rounded by 0.1) in a network, demonstrating 
the capability of the proposed approaches to identify possi- 
ble TF regulated target genes. 

When setting the cut-off threshold 0.5, the result con- 
firmed 281 regulations among the 282 regulations that 
were defined in the TRANS F AC database, and identified 
25 new regulations that are not recorded in the database. 
This fact demonstrates the ability of our algorithm to dis- 
cover new regulations and discern context-dependent 
regulations among the prior knowledge, and the recon- 
structed network is shown in Fig. 7, showing the capabil- 
ity of the proposed approach to identify both the 
strength (represented by edge width) and the type (repre- 
sented by edge color) of transcriptional regulations. 

Along with the estimates of regulatory coefficients, the 
transcription factor activities and the sample cluster attri- 
butes were also obtained. Fig.8 depicts the estimated TF 
activities, with the patient samples grouped according to 



the clustering result, and it clearly shows the coordinated 
clustering effects. To further gain insights into the clini- 
cal outcomes of different patient groups defined by the 
TF activities, survival analysis was carried out and it con- 
firmed the survival difference between the the 1st and 
2nd clusters (p = 0.05) as shown in Fig.9. Previous studies 
based on expression levels [33-36] identified 5 major sub- 
types (luminal A, luminal B, basal, ERBB2 overexpressing, 
and normal-like). We compared the pair-wise survival 
difference between our clustering (3 clusters) and pre- 
vious result (5 clusters). It shows the superior perfor- 
mance of our method (Table 3) over the previous 
computation based method (Table 4). 

Discussion 

We proposed a new approach to uncover the transcrip- 
tional regulatory networks from microarray gene expres- 
sion profiles. We discuss next a few distinct features of it. 
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(b) Factor Estimation Result 
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(d) Loadings Estimation Result 
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Figure 4 Impact of database recall. The performance vs database recall is shown here. It can be seen from the figure that, the algorithm 
performance deceases as noise increases. While the target prediction relies on the quality of database heavily (Fig.4-c), the clustering result is 
relatively robust against database quality (Fig.4-a) . 



Table 1 List of tested 15 TFs and aliases 





Name 


Target 


Aliases 


1 


EBP-ce 


21 


BPc; C/EBP; C/EBP alpha; C/EBPalpha; CBP; 


2 


ETS-1 


14 


c-Ets-1; c-Ets-1 54; c-Ets-1A; Ets 1 ; p54; p54c-Ets-1. 


3 


FOS 


23 


c-Fos; FBJ osteosarcoma oncogene; p55(c-fos). 


4 


MYC 


11 


c-Myc; MYC; v-myc myelocytomatosis viral oncogene homolog (avian). 


5 


CREB 


22 


ATF-47; CREB; CREB-341; CREB-A; CREB-isoforml; CREB1; 


6 


ATF-2 


16 


activating transcription factor 2; ATF2; CRE-BP1; CREB2; CREBP1; 


7 


EGR-1 


24 


AT225; early growth response protein 1; 


8 


EBP-/? 


23 


AGP/EBP; ANF-2; C/EBP beta; C/EBP-beta; C/EBPbeta; CEBPB; 


9 


NF-kB 


28 


NFkappaB; Nuclear Factor kappa B. 


10 


P53 


20 


ASp53; LFS1; NSp53; p53; p53as; RSp53; tp53; TRP53; 


11 


ATM 


14 


activating transcription factor 1; ATF1; EWS-ATF1; FUS/ATF-1; 


12 


STAT-3 


12 


acute-phase response factor; APRF; 


13 


STAT-1 


19 


signal transducer and activator of transcription 1. 


14 


AP-2 


16 


activating enhancer binding protein 2 alpha; activator protein-2; 


15 


CREB-1 


19 


ATF-47; CREB; CREB1; cyclic AMP response element-binding protein; 



The tested 15 transcription factors and their aliases. 
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Table 2 List of tested 199 genes 





Jy IIIDLH 




jy ihdui 




jy IIIDLM 




jy IIIDUI 


1 


LALK4 


51 


rno") 

LUoz 


1 01 




151 


LUKN I A 


2 


TAT 

LA I 


52 


Ul A P\DA 


1 02 


Dl K /\ 1 

rllvl I 


1 52 


r 1 1 U 1 


3 


FOS 


53 


VIP 


1 03 


COL1 A2 


1 53 


MITF 


4 


MT2A 


54 


INS 


1 04 


IL2RB 


1 54 


HBB 


5 


DC \ A DD 


55 


PTGS2 


1 05 


ZNhzoo 


1 55 


CSF1 


6 


DBH 


56 


APOA2 


1 06 


GSN 


1 56 


TIMP1 


7 


bhKrINL I 


57 


FGFR2 


1 07 


I NhKbh IUL 


1 57 


F9 


8 


CHEK1 


58 


CCND1 


1 08 


CXCL3 


1 58 


VHL 


9 


SCN3B 


59 


CASP1 


1 09 


LbNKZD 


1 59 


CD1 A 


1 0 


F7 


60 


HBB 


1 1 0 


TRA@ 


1 60 


SFN 


1 1 


ITGAX 


61 


COL2A1 


1 1 1 


1— II A P\DD 1 

HLA-DrD I 


1 61 


SOAT1 


1 2 


EIF4E 


62 


IVlUIVlz 


1 1 2 


TRA@ 


1 62 


FCGR1 A 


1 3 


TGFB2 


63 


RB1 


1 1 3 


TP53 


1 63 


FAS 


14 


CDC25A 


64 


NDRG1 


1 14 


SOX9 


1 64 


HBG1 


1 5 


IL3 


65 


BRCA1 


1 1 5 


A I AVC A D 

ALOX5AP 


1 65 


WARS 


1 6 




66 


BAX 


1 1 6 


TOP1 


1 66 


KIKbDL 1 


1 7 


IL1 0 


67 


ATF2 


1 1 7 


NFKB1 


1 67 


CD8A 


1 8 


F3 


68 


FN1 


1 1 8 


IL2 


1 68 


IL6 


1 9 


IL2RA 


69 


BCL2L1 


1 1 9 


SLC9A3 


1 69 


TWIST1 


20 


BDNF 


70 


CCR5 


1 20 


r~\zno A A 


1 70 


CXCL1 


21 


WEE1 


71 


TF 


121 


CRH 


1 71 


IFNB1 


22 


CYPl I Al 


72 


TFRC 


1 22 


CIITA 


1 72 


PTK2 


23 


MD/I A ~) 

N K4A2 


73 


HD 


1 23 


KhVVUz 


1 73 


SPP1 


24 


\ /U I 

VHL 


74 


rvri 1 

LXLL I 


1 24 


i r\£> 
LUK 


1 74 


Lbh I 


25 


TRH 


75 


LbNKl Al 


1 25 


REN 


1 75 


TP73 


26 


S0D2 


76 


NR3C1 


1 26 


YBX1 


1 76 


CD53 


27 


CSF2RA 


77 


bPINK I 


1 27 


ATF3 


1 77 


NAB2 


28 


MUC1 


78 


EGR1 


1 28 


TEAD1 


1 78 


PTTG1 


29 


IVlbrV 


79 


LUN I 


1 29 




1 79 


IL1 B 


30 


GNAI2 


80 


TFAP2A 


1 30 


APAF1 


1 80 


APOB 


31 


DRD1 


81 


CFTR 


131 


LYPl 9Al 


181 


IL8 


32 


ADRB2 


82 


MYC 


1 32 


ACE 


1 82 


TAF7 


33 


GCLC 


83 


FMR1 


1 33 


KRT1 6 


1 83 


DTD A A 1 

r I P4A I 


34 


UrnlVI I 


84 


CO 

ho 


1 34 


NUbzA 


1 84 


UCni "7DO 

HbU I /bo 


35 


IFNG 


85 


TSC22D3 


1 35 


FXR2 


1 85 


ABCB1 


36 


BCL2A1 


86 


FGF2 


1 36 


IRF1 


1 86 


PBK 


37 


CCL5 


87 


LOR 


1 37 


CGA 


1 87 


TACR1 


38 


ICAM1 


88 


PTHLH 


1 38 


KRT14 


1 88 


IVIAUd 


39 


DC l~N ITK I 

PSLNLN 


89 


bl 00A9 


1 39 


ABCA2 


1 89 


RPL1 0 


40 


IER2 


90 


GAUU45A 


140 


FGA 


1 90 


IVL 


41 


S0D1 


91 


EX01 


141 


TALD01 


1 91 


ERBB2 


42 


GNRHR 


92 


PLAU 


142 


CSF1 


192 


CCL2 


43 


LTA 


93 


PTH 


143 


SFTPD 


193 


BBC3 


44 


TERT 


94 


CDK4 


144 


CRP 


194 


TP63 


45 


TNFAIP6 


95 


PPARG 


145 


TPT1 


195 


RFWD2 


46 


0DC1 


96 


POLB 


146 


SLC9A2 


196 


FGFR4 


47 


LTF 


97 


ID1 


147 


CYP2A13 


197 


NAT1 


48 


PRLR 


98 


MT2A 


148 


DDX18 


198 


SELE 


49 


TNF 


99 


SST 


149 


CCNA2 


199 


FASLG 


50 


MMP1 


100 


KRT14 


150 


IL6ST 







The tested 199 genes. 
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Posterior Probability 

Figure 5 Histogram of the regulation posterior probabilities. 

The figure shows the histogram of posterior probabilities of all the 

possible regulations. Two distinct groups can be identified, each 

representing a group of regulations that are likely to happen (p > 

0.5) or not (p < 0.5). 
v / 

First, to reflect the fact that a TF only regulates a 
small number of genes among the whole genome, the 
loading matrix of the factor model is constrained by a 
sparse prior [16], which directly reflects our existing 
knowledge of the particular TF-gene regulation, i.e., if 
the regulation exists according to prior knowledge, the 
probability of the corresponding component of the load- 
ing matrix to be non-zero is large; or otherwise, very 
small. The introduction of sparsity significantly con- 
strains the factor model and helps to enable the infer- 
ence of a set of correlated samples. 

Second, since the activities of TFs cannot be negative, 
the factors are modeled by a non-negative rectified 




Figure 6 Network of the posterior probability of regulations. 

The network depicts possible regulations and their posterior 
probabilities (rounded by 0.1), where a edge indicates a possible 
regulation from a TF to a gene, and the line width of the edge 
represents the probability of the particular regulation exists, with 
thicker line width stands for larger probability; and vise versa. 

V J 
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Figure 7 Network of the regulation coefficients. The network 
shows regulation coefficients, where an edge between a TF and a 
gene indicates the gene is regulated by the TF, and the color of the 
edge indicates the regulation types (red for up-regulation, and blue 
for down regulations), and the line width stands for the regulation 
strength. 



Gaussian distribution [19], which not only is consistent 
with the physical fact of TF regulation but also avoids 
the inherent sign ambiguity problem of the factor mod- 
els. Noted that, a rectified Gaussian distribution j\f R is 
different from a truncated Gaussian j\f T in that: 



p(x = 0) = 



0 ifx~ M t {ia,g 2 ) 

0(-az/ct) if am N R {n,a 2 ) 



indicating that the rectified Gaussian model can also 
describe the possible suppressed state of TFs, which 




c/) 
c 
o 
'■*—> 
o 

M— 

03 

> 
(/) 

"O 

CD 
03 
E 

(I) 
LU 
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Survival Months 

Figure 9 Survival difference between cluster 1 and 2. The 

survivals of the two estimated clusters show statistical difference p 
= 0.04 when using logrank test, indicating the two clusters are 
potentially corresponding to two subtypes of breast cancer that 
have different survival time. 



nevertheless cannot be modeled by the truncated Gaus- 
sian distribution. A comparison of Gaussian, rectified 
Gaussian, and truncated Gaussian is shown as Fig. 10. In 
our model, the non-negativity is constrained only on the 
factor matrix; the elements of loading matrix can be 
either positive or negative, which models the corre- 
sponding up- or down-regulation of TFs. This is 




EBP-alpha 



r-r-T- csicsicsicsi Csi rococo rococo co co **tm lO c^c^roro^^t^t^t^i^ 
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Figure 8 Estimated transcription factor expression. The tested samples fall into 3 major clusters with 24, 1 1 and 1 1 samples. The rest 7 
samples may be considered as outliers that are not classified. In accordance with the sample clustering result, the recovered TF shows 3 major 
clustering patterns with a few outliers. 
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Table 3 Survival test of clustering results 





Cluster 1 


Cluster 2 


Cluster 3 


Cluster 1 


N/A 


0.04 


0.16 


Cluster 2 


0.04 


N/A 


0.93 


Cluster 3 


0.16 


0.93 


N/A 



The logrank test result of the survival difference between each pair of 
estimated clusters (min= 0.04). The survivals of cluster 1 and cluster 2 are 
significantly different with p = 0.04, indicating the two predicted clusters may 
each represent a subtype of breast cancer. 



Table 4 Survival test of previous results 





luminal A 


luminal B 


Basal-like 


HER2+/ER- 


normal-like 


luminal A 


N/A 


0.75 


0.76 


0.42 


0.83 


luminal B 


0.75 


N/A 


0.98 


0.7 


0.8 


Basal-like 


0.76 


0.98 


N/A 


0.67 


0.94 


HER2+/ER- 


0.42 


0.7 


0.67 


N/A 


0.46 


normal-like 


0.83 


0.8 


0.94 


0.46 


N/A 



The logrank test result of the survival difference between each pair of 
previous predicted clusters [33-36]. None of the pair shows statistical 
difference (min= 0.42). 



different from non-negative matrix factorization (NMF) 
[13,15,37,38]. NMF enforces that both the loading 
matrix and the factor matrix must be non-negative, i.e., 
all elements must be equal to or greater than zero. With 
the capability of modeling both the up- or down-regula- 
tions, the proposed BNFM is more appropriate for mod- 
eling the TF regulation than NMF. 

To model the samples correlation due to, for instance, 
cancer subtypes, the samples are modeled by a Dirichlet 
process mixture (DPM), which imposes clustering effect 
among samples and can automatically determine the 
optimal number of clusters from data rather than be pre- 
defined in the algorithm. Forth, other types of data, such 
as ChlP-chip data [39-41] and DNA methylation data 
[42] can be conveniently integrated with gene expression 



data [43] under the proposed framework by setting a 
slightly different prior probabilities to the loading matrix. 
Integrating additional data types can potentially improve 
the accuracy of the reconstructed networks. [12]. 

However, the proposed model is not without short- 
comings. First, this model can not yet capture regula- 
tions from TFs that are not specified in the prior 
knowledge database. In reality, it is possible that some 
TFs that are not specified in the prior knowledge actu- 
ally regulate the gene transcription. Second, the algo- 
rithm may not converge in a reasonable number of 
iterations on a large data set, thus cannot yet be applied 
to genome wide data set. Because the model parameters 
are high dimensional and highly correlated, the speed of 
convergence may significantly slow down on a large 
data set [44,45]. However, such restriction on the size of 
genes and TFs forces us to focus the analysis on most 
relevant genes and TFs, making the analysis more tar- 
geted and easy to interpret. We demonstrate in section 
Result, how such analysis can be carried out starting 
from a whole genome microarray data. With the 
advancement in ChlP-seq technology and increasing 
knowledge of TFs biological functions, the proposed 
model could be applied for a genome-wide study in the 
future. 

Thirdly, the prior knowledge may still need to be 
properly evaluated. If the prior knowledge is considered 
an estimation of the true TRN, when the precision p, 
recall r of prior information and the sparsity of the load- 
ing matrix 5 is given, the prior probability of the £-th 
gene to be a target of the /-th TF n gtl can be calculated 
as follows: 

[ p recorded regulation 

g ' 1 [ 5p(l - r) / (p - sr) not recorded regulation 




Figure 10 Comparison of original, rectified and truncated Gaussian distributions. The probability distribution function of Gaussian, rectified 
Gaussian and truncated Gaussian are shown in this figure. The range of Gaussian distribution A/*(0, l 2 ) is f rom ( _00 » °°)» the range of rectified 
Gaussian A/^O,! 2 ) is [0, °°], and the range of truncated Gaussian A/" T (0, l 2 , 0, 3) is (0, 3). 
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However, the precision or recall of the prior knowl- 
edge database are only arbitrarily specified (both 90%). 
In practice, the quality of prior knowledge should be 
evaluated first before more reasonable prior probabilities 
of regulations can be assigned. 

Conclusions 

A Bayesian factor model that has sparse loading 
matrix, non-negative factors, and correlated samples, 
was proposed to unveil the latent activities of tran- 
scription factors and their targeted genes from 
observed gene mRNA expression profiles. By naturally 
incorporating the prior knowledge of TF regulated 
genes, the sparsity constraint of the loading matrix, the 
non-negativity constraints of TF activities, the regula- 
tion coefficients and TF activities can be estimated. A 
Gibbs sampling solution was proposed and model 
inference. The effectiveness and validity of the model 
and the proposed Gibbs sampler were evaluated on 
simulated systems. The proposed method was applied 
to the breast cancer microarray data and a TF regu- 
lated network for breast cancer data was reconstructed. 
The inferred TF activities indicates 3 patients clusters, 
two of which possess significant differences in survival 
time after treatment. These results demonstrated that 
the BNFM provides a viable approach to reconstruct 
TF mediated regulatory networks and estimate TF 
activities from mRNA expression profiles. The BNFM 
will be an important tool for not only understanding 
the transcriptional regulation but also predicting the 
clinical outcomes of treatment. 
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